This course is all about finding the joy in doing science together. It’s about learning to use open data sources with tools that are available to everyone.
https://github.com/pauliinaanttila/IODS-project
This week we really got started with data wrangling and analysis with RStudio. I haven’t been using RStudio before, so this all has been quite exciting and challenging. After a few hours of trial and error, I’m starting to learn the basic tricks about exploring and analyzing data in RStudio.
students2014 <- read.csv (file = "~/Documents/GitHub/IODS-project/data/learning2014.csv", header = TRUE, sep = ",") # reading a CSV file named learning2014 into R
str(students2014) # exploring the structure of the data, gives the names and types of variables
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014) # exploring the dimensions of the data, contains 166 observations of 7 variables
## [1] 166 7
We are working with students2014 dataset this week. The dataset contains 166 students who took part in a course called Introduction to Social Statistics in fall 2014 and also attended the final exam. The variables collected to the dataset are gender, age, attitude, deep learning, strategic learning, surface learning and exam points. The attitude towards statistics and details concerning learning methods were gathered through a questionnaire.
pairs(students2014[-1], col = students2014$gender) # draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix
This scatter plot matrix shows all the possible scatter plots from the columns of the students2014 data frame, but it’s a bit tricky to interpret as such.
summary(students2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The dataset includes 110 females and 56 males. The mean age of the studied population is 25.5 years. The youngest participant has been 17 years old and the oldest 55 years old. The points for attitude vary between 1.4 and 5 with a mean of 3.14. The exam points vary between 7 and 33 with a mean of 22.7. The means for deep, strategic and surface learning are 3.7, 3.1 and 2.8, respectively.
library(GGally)
library(ggplot2)
p <- ggpairs(students2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
The age of the studied population is not normally distributed, the mass of the distribution is concentrated to the left of the figure, i.e. the skewness is positive. Said simpler, in the studied population, the majority of the people are in their twenties, but there are a few people that are noticeably older as well.
The distributions of attitude, strategic learning and surface learning are fairly normally distributed. The distributions of deep learning and points are concentrated to the right of the figure, i.e. skewness is negative. In the case of points, there are more students that have scored well in the examination compared to those who have not done so well.
Male students seem to have a bit better exam points than female students. There is a mild negative correlation with age and points. Attitude seems to have a positive correlation with exam points. Strategic learning is advantageous in terms of having good exam points, surface learning has negative correlation with points.
Regression analysis is a statistical modelling method used to estimate the relationships among variables. The focus is on the relationship between a dependent variable (here points) and one or more independent variables (here attitude, strategic learning and surface learning).
my_model <- lm(points ~ attitude + stra + surf, data = students2014) # a multiple regression model where points is the dependent variable and attitude, strategic learning and surface learning are explanatory variables
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In this example, the multiple regression model includes points as a dependent variable and attitude, strategic learning and surface learning as explanatory variables. The only variable that has a statistically significant relationship with points is attitude. The coefficient is 3.4 (p-value 1.93e-08) meaning that one gets better exam points with better attitude.
my_model2 <- lm(points ~ attitude, data = students2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
It seems that better attitude gives one better results in the exam measured by points.
R-squared is a statistical measure of how close the data are to the fitted regression line. In general, a model fits the data well if the differences between the observed values and the model’s predicted values are small and unbiased. In general, the higher the R-squared, the better the model fits your data. The values of R-squared are between 0 and 1 (0-100 %).
The R-squared is moderately low here (0.19), but it is understandable given the nature of the study (any field that attempts to predict human behavior typically has R-squared values lower than 50%, humans are simply harder to predict than, say, physical processes.)
plot(my_model2, which = c(1,2,5))
A bit about diagnostic plots:
Diagnostic plots are used to test the validity of models.
1) Residuals vs Fitted values provides a method to explore residuals versus model predictions. With this method one can test the constant variance assumption. The constant variance assumption implies that the size of errors should not depend on the explanatory variables.
In this example, the assumption is reasonable (the observations reasonably randomly spread).
2) The QQ-plot of the residuals provides a method to explore the assumption that the errors of the model are normally distributed.
The QQ-plot in this example is quite reasonable, even though in both ends the observations differ from the straight line. The normality assumption is reasonable here.
3) Residuals versus leverage plot can help identify which observations have an unusually high impact.
In this example, there are no observations that would have a markably higher leverage.
This week we are working with a joined dataset that contains two Portuguese student alcohol consumption data sets. Originally, the data sets are found from the UCI Machine Learning Repository under the name “Student performance data set” (donated Nov 2014). The original data sets contain data about student performance in secondary education of two Portuguese schools in two distinct subjects: mathematics and Portuguese language. The data was collected by using school reports and questionnaires. We wrangled the data so that we combined the data sets (mathematics and Portuguese language).
The wrangled data set contains 382 observations of 35 variables. The variables include the following:
1 school - student’s school
2 sex - student’s sex
3 age - student’s age
4 address - student’s home address type
5 famsize - family size
6 Pstatus - parent’s cohabitation status
7 Medu - mother’s education
8 Fedu - father’s education
9 Mjob - mother’s job
10 Fjob - father’s job
11 reason - reason to choose this school
12 guardian - student’s guardian
13 traveltime - home to school travel time
14 studytime - weekly study time
15 failures - number of past class failures
16 schoolsup - extra educational support
17 famsup - family educational support
18 paid - extra paid classes within the course subject (Math or Portuguese)
19 activities - extra-curricular activities
20 nursery - attended nursery school
21 higher - wants to take higher education
22 internet - Internet access at home
23 romantic - with a romantic relationship
24 famrel - quality of family relationships
25 freetime - free time after school
26 goout - going out with friends
27 Dalc - workday alcohol consumption
28 Walc - weekend alcohol consumption
29 health - current health status
30 absences - number of school absences
31 G1 - first period grade
32 G2 - second period grade
33 G3 - final grade
34 alc_use - the average of alcohol consumption on weekdays and weekends
35 high_use - TRUE if “alc_use is higher than 2”
alc <- read.csv (file = "~/Documents/GitHub/IODS-project/data/alc.csv", header = TRUE, sep = ",", dec = ".") # reading data into R
colnames(alc) # printing out the names of the variables
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc) # there are 382 observations of 35 variables in the data
## [1] 382 35
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
1. Older students consume high amounts of alcohol more often than younger ones.
2. Those who have high use of alcohol tend to fail more often.
3. Those who have extra-curricular activities consume less alcohol.
4. Those who want to take higher education consume less alcohol.
1. Age and alcohol consumption:
library(ggplot2)
p1 <- ggplot(alc, aes(x = high_use, y = age))
p1 + geom_boxplot() + ylab("age")
t1 <- table(alc$high_use, alc$age) # crosstabulation of "high_use" and "age"
t1
##
## 15 16 17 18 19 20 22
## FALSE 63 79 64 54 7 1 0
## TRUE 18 28 36 27 4 0 1
According to the boxplots and crosstabulation, it seems that older students consume high amounts of alcohol more often than the younger ones.
2. Alcohol consumption and failures:
t2 <- table(alc$high_use, alc$failures) # crosstabulation of "high_use" and "failures"
t2
##
## 0 1 2 3
## FALSE 244 12 10 2
## TRUE 90 12 9 3
counts <- table(alc$high_use, alc$failures)
barplot(counts, main="High alcohol use and failures",
xlab="Failures",
legend = rownames(counts), beside=TRUE)
Number of failures in the x-axis and count of students on the y-axis. Grouped by “high_use” (either TRUE or FALSE).
Only a small proportion of students have failed and the absolute counts of failures are fairly similar in both groups (those who use high amounts of alcohol vs those who do not). It is to be noted, that there is a lot more students in the latter group (268 vs 114).
3. Alcohol consumption and activities:
t3 <- table(alc$high_use, alc$activities)
t3
##
## no yes
## FALSE 122 146
## TRUE 59 55
counts <- table(alc$high_use, alc$activities)
barplot(counts, main="High alcohol use and activities",
xlab="Activities",
legend = rownames(counts), beside=TRUE)
The proportion of students who consume high amounts of alcohol is bigger in the group that has no activities.
4. Alcohol consumption and the urge to take higher education:
t4 <- table(alc$high_use, alc$higher)
t4
##
## no yes
## FALSE 9 259
## TRUE 9 105
counts <- table(alc$high_use, alc$higher)
barplot(counts, main="High alcohol use and the urge to take higher education",
xlab="The urge to take higher education",
legend = rownames(counts), beside=TRUE)
The majority of the students want to take higher education.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
my_model <- glm (formula = high_use ~ age + failures + activities + higher, data = alc, family ="binomial") # Logistic regression model with high_use as target variable and age, failures, activities and higher education target as explanatory variables.
summary(my_model)
##
## Call:
## glm(formula = high_use ~ age + failures + activities + higher,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4581 -0.8444 -0.7406 1.3466 1.7584
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1430 1.8387 -1.709 0.0874 .
## age 0.1528 0.1003 1.523 0.1279
## failures 0.4397 0.1948 2.257 0.0240 *
## activitiesyes -0.1820 0.2290 -0.795 0.4266
## higheryes -0.2731 0.5435 -0.503 0.6153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 453.19 on 377 degrees of freedom
## AIC: 463.19
##
## Number of Fisher Scoring iterations: 4
coef(my_model) # coefficients of the model
## (Intercept) age failures activitiesyes higheryes
## -3.1429502 0.1527828 0.4397280 -0.1820300 -0.2731132
OR <- coef(my_model) %>% exp # computing odds ratios
CI <- confint(my_model) %>% exp # computing confidence intervals
## Waiting for profiling to be done...
cbind(OR,CI) # printing ORs and CIs
## OR 2.5 % 97.5 %
## (Intercept) 0.0431553 0.001136953 1.561687
## age 1.1650719 0.957724023 1.420525
## failures 1.5522849 1.059603697 2.289548
## activitiesyes 0.8335763 0.531757595 1.306573
## higheryes 0.7610067 0.262896907 2.277735
The only examined variable that is statistically significantly associated to high use of alcohol is failures. The odds ratio of failures is 1.55 with confidence interval of 1.06-2.29 (p-value 0.024). With high consumption of alcohol the odds of failing is 1.55 times higher than with smaller alcohol consumption.
The previous data exploring and logistic regression model show similar trends with other variables, but they are not statistically significant.
my_model2 <- glm(high_use ~ failures, data = alc, family = "binomial") # fitting the model with the only variable (failures) that had statistically significant relationship with high use of alcohol
probabilities <- predict(my_model2, type = "response") # predicting the probability of high use
alc <- mutate(alc, probability = probabilities) # adding the predicted probabilities to 'alc'
alc <- mutate(alc, prediction = probability > 0.5) # using the probabilities to make a prediction of high_use
table(high_use = alc$high_use, prediction = alc$prediction) # tabulation of the target variable versus the predictions
## prediction
## high_use FALSE TRUE
## FALSE 256 12
## TRUE 102 12
library(dplyr)
library(ggplot2)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction)) # initializing a plot of 'high_use' versus 'probability' in 'alc'
g + geom_point() # defining the geom as points and drawing the plot
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins() # tabulating the target variable versus the predictions
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67015707 0.03141361 0.70157068
## TRUE 0.26701571 0.03141361 0.29842932
## Sum 0.93717277 0.06282723 1.00000000
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
} # defining a loss function (mean prediction error)
loss_func(class = alc$high_use, prob = alc$probability) # calling loss_func to compute the average number of wrong predictions in the (training) data
## [1] 0.2984293
The average amount of wrong predictions in the (training) data is 29,8%. The proportion is quite high but still better than pure quessing.
This week we are working with Boston data (a data set that is included in R, to be more precise, MASS package). The data deals with housing values in suburbs of Boston. The data includes 506 observations of 14 variables.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(Boston) # loading the Boston data
dim(Boston) # exploring the dimensions of the data
## [1] 506 14
str(Boston) # exploring the structure of the data
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
List of variables:
crim = per capita crime rate by town
zn = proportion of residential land zoned for lots over 25,000 sq.ft
indus = proportion of non-retail business acres per town
chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox = nitrogen oxides concentration (parts per 10 million)
rm = average number of rooms per dwelling
age = proportion of owner-occupied units built prior to 1940
dis = weighted mean of distances to five Boston employment centres
rad = index of accessibility to radial highways
tax = full-value property-tax rate per $10,000
ptratio = pupil-teacher ratio by town
black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat = lower status of the population (percent)
medv = median value of owner-occupied homes in $1000s
The sources of the data
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 1.3.4 ✔ purrr 0.2.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ MASS::select() masks dplyr::select()
cor_matrix<-cor(Boston) %>% round(2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Above, you can see a correlation plot of the data. It seems, for example, that the index of accessibility to radial highways is strongly positively correlated with full-value property-tax rate. There seems to be a clear negative correlation between lower status of the population and median value of owner-occupied homes.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Above, you can see summaries of the variables in the data. To mention a few interesting ones:
- Per capita crime rate by town differs substantially between the towns with the minimum of 0.006 and the maximum of 89.0 (median 0.26).
- Pupil-teacher ratio by town varies between 12.6-22.0 (median 19.05).
- Lower status of the population (percent) varies between 1.7-38.0 (median 11.4).
boston_scaled <- scale(Boston) # standardizing the data set
summary(boston_scaled) # summary of the scaled data
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
In the scaling, the column means were subtracted from the corresponding columns and the difference was divided with standard deviation. In the scaled data you can see that the means are now 0 for every variable. Scaling needs to be done so that we can perform the linear discriminant analysis.
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high")) # creating a categorical variable crime with 4 categories (low, med_low, med_high and high)
boston_scaled <- dplyr::select(boston_scaled, -crim) # dropping the old 'crim' variable from the scaled data
boston_scaled <- data.frame(boston_scaled, crime) # adding the new categorical value to scaled data
n <- nrow(boston_scaled)
n # number of rows is 506
## [1] 506
ind <- sample(n, size = n * 0.8) # choosing randomly 80% of the rows
train <- boston_scaled[ind,] # creating the train set
test <- boston_scaled[-ind,] # creating the test set
library(MASS)
lda.fit <- lda(formula = crime ~., data = train) # fitting the LDA, the categorical crime rate as the target variable and all the other variables in the dataset as predictor values
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes)
The linear discriminant analysis biplot above.
crime_categories <- test$crime # saving the crime categories from the test set
test <- dplyr::select(test, -crime) # taking the 'crime' variable away from the test set
lda.pred <- predict(lda.fit, newdata = test) # predicting the classes with the LDA model on the test data
table(correct = crime_categories, predicted = lda.pred$class) # cross tabulating the results of prediction with the crime categories from the test set
## predicted
## correct low med_low med_high high
## low 13 9 2 0
## med_low 7 10 10 0
## med_high 2 9 18 0
## high 0 0 0 22
From the cross tabulation above we can see that the model appears to predict the correct results reasonably well.
library(MASS)
data(Boston) # reloading the Boston data set
boston_scaled <- scale(Boston) # standardizing the data set
dist_eu <- dist(boston_scaled) # calculating the distances between the observations
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
The Euclidian distances calculated and summarized.
km <-kmeans(Boston, centers = 4) # running the k-means algorithm on the dataset, 4 was chosen to be the number of clusters
pairs(Boston, col = km$cluster)
This picture is too big and complicated to comprehend, so let’s take only part of it for closer examination.
pairs(Boston[1:5], col = km$cluster)
Now let’s try to find out what is the optimal number of clusters. One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
set.seed(123)
k_max <- 10 # determining the maximum number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss}) # calculating the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualizing the results
The optimal number of clusters seems to be 2. Let’s run the K-means algorithm again with this number of clusters:
km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
pairs(Boston[1:5], col = km$cluster)